In this study, we investigate the interplay between air quality and climate in India, focusing on key pollutants and meteorological patterns. Our analysis, grounded in comprehensive datasets spanning several years and cities, aims to elucidate the distribution of air pollutants, such as ozone, particulate matter, and nitrogen dioxide, and their correlation with climatic variables like temperature and rainfall. Through this exploration, we seek to understand the dynamics of environmental quality and its implications for health and well-being. The project’s core objective is to provide a nuanced understanding of environmental conditions in India, offering insights valuable for policy development and environmental management.
The objectives of this mini-project encompass a comprehensive analysis of air quality, climatic trends, and rainfall prediction across various cities in India. This endeavor is multi-faceted, aiming not only to assess and compare current conditions but also to predict future trends. The objectives can be broadly categorized as follows:
| Dataset Name | Description | Time Period | Coverage | Frequency | Parameters |
|---|---|---|---|---|---|
| City Level Data | Data for major cities in India | 2015 to 2020 | 18 major cities in India | Daily and hourly | PM2.5, PM10, NO2, SO2, CO, O3, AQI |
| Station Level Data | Localized air quality measurements at various stations within cities | - | Multiple stations within cities | Hourly and daily | Similar to city level data |
| Dataset Name | Description | Time Period | Coverage | Frequency | Parameters |
|---|---|---|---|---|---|
| General Weather Data | Weather data for major Indian cities | 1990 to 2022 | 8 major cities in India | Daily | Min, max, average temperatures, precipitation |
| File Name | Description | Time Period | Coverage | Frequency | Parameters |
|---|---|---|---|---|---|
| weather_Rourkela.csv | Weather data for Rourkela | 2021 to 2022 | Rourkela | Daily | Temperature, precipitation |
| weather_Bhubaneshwar.csv | Weather data for Bhubaneshwar | 1990 to 2022 | Bhubaneshwar | Hourly | Temperature, precipitation |
| Rajasthan_1990_2022.csv | Weather data for Jodhpur | 1990 to 2022 | Jodhpur | Daily | Temperature, precipitation |
| Mumbai_1990_2022_Santacruz.csv | Weather data for Santacruz (Mumbai) | 1990 to 2022 | Santacruz, Mumbai | Daily | Temperature, precipitation |
| Lucknow_1990_2022.csv | Weather data for Lucknow | 1990 to 2022 | Lucknow | Hourly | Temperature, precipitation |
| Delhi_NCR_1990_2022_Safdarjung.csv | Weather data for Safdarjung (Delhi) | 1990 to 2022 | Safdarjung, Delhi | Daily | Temperature, precipitation |
| Chennai_1990_2022_Madras.csv | Weather data for Chennai | 1990 to 2022 | Chennai | Daily | Temperature, precipitation |
| Bangalore_1990_2022_BangaloreCity.csv | Weather data for Bangalore | 1990 to 2022 | Bangalore | Hourly | Temperature, precipitation |
| Station_GeoLocation_Longitude_Latitude_Elevation | Geographical characteristics of stations | - | Stations in various cities | - | Longitude, latitude, elevation |
This objective aims to provide a holistic understanding of air quality, climatic conditions, and rainfall patterns in India, drawing on comprehensive data analysis, comparative studies, and predictive modeling. The insights gained will be crucial for informing environmental policies, urban planning, and public health strategies.
End of Objective
In this section, we delve into the preliminary analysis of both the climate and air quality datasets. This includes our initial findings, data cleaning steps, and basic data explorations. The analysis begins with climate data, followed by air quality data, to provide a comprehensive overview.
This preliminary analysis sets the stage for more in-depth investigations into both climate and air quality datasets, laying the groundwork for further statistical analysis, comparative studies, and predictive modeling.
# Loading the datasets
# Bangalore
bangalore_df <- read.csv("d1/Bangalore_1990_2022_BangaloreCity.csv")
# Chennai
chennai_df <- read.csv("d1/Chennai_1990_2022_Madras.csv")
# Delhi
delhi_df <- read.csv("d1/Delhi_NCR_1990_2022_Safdarjung.csv")
# Lucknow
lucknow_df <- read.csv("d1/Lucknow_1990_2022.csv")
# Mumbai
mumbai_df <- read.csv("d1/Mumbai_1990_2022_Santacruz.csv")
# Rajasthan (Jodhpur)
rajasthan_df <- read.csv("d1/Rajasthan_1990_2022_Jodhpur.csv")
# Bhubaneswar
bhubaneswar_df <- read.csv("d1/weather_Bhubhneshwar_1990_2022.csv")
# Rourkela
rourkela_df <- read.csv("d1/weather_Rourkela_2021_2022.csv")
# Load the Station GeoLocation data
station_geo_df <- read.csv("d1/Station_GeoLocation_Longitute_Latitude_Elevation_EPSG_4326.csv")
# Preprocessing the Bangalore dataset
library(dplyr)
# Convert date column to Date format
bangalore_df$time <- as.Date(bangalore_df$time, format = "%d-%m-%Y")
# Filter data for 2015-2020
bangalore_df <- bangalore_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#chennai
chennai_df$time <- as.Date(chennai_df$time, format = "%d-%m-%Y")
chennai_df <- chennai_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#Delhi
delhi_df$time <- as.Date(delhi_df$time, format = "%d-%m-%Y")
delhi_df <- delhi_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#Lucknow
lucknow_df$time <- as.Date(lucknow_df$time, format = "%d-%m-%Y")
lucknow_df <- lucknow_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#MUmbai
mumbai_df$time <- as.Date(mumbai_df$time, format = "%d-%m-%Y")
mumbai_df <- mumbai_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#rajasthan
rajasthan_df$time <- as.Date(rajasthan_df$time, format = "%d-%m-%Y")
rajasthan_df <- rajasthan_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#bhub
bhubaneswar_df$time <- as.Date(bhubaneswar_df$time, format = "%d-%m-%Y")
bhubaneswar_df <- bhubaneswar_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
#rourkela
rourkela_df$time <- as.Date(rourkela_df$time, format = "%d-%m-%Y")
rourkela_df <- rourkela_df %>%
filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")
# Handling missing values in Bangalore dataset
bangalore_df <- bangalore_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
# Chennai
chennai_df <- chennai_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
#Delhi
delhi_df <- delhi_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
# Lucknow
lucknow_df <- lucknow_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
#Mumbai Dataset
mumbai_df <- mumbai_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
#rajasthan
rajasthan_df <- rajasthan_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
#bhub
bhubaneswar_df <- bhubaneswar_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
#rourkela
rourkela_df <- rourkela_df %>%
mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))
# Example for Bangalore
summary_stats_bangalore <- bangalore_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
summary_stats_bangalore
## Mean_Tavg Median_Tavg SD_Tavg Mean_Prcp Median_Prcp SD_Prcp
## 1 24.18695 23.8 2.226738 5.930541 5.930541 8.966999
#chennai
summary_stats_chennai <- chennai_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
# Delhi
summary_stats_delhi <- delhi_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
#Lucknow
summary_stats_lucknow <- lucknow_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
#Mumbai
summary_stats_mumbai <- mumbai_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
#Rajasthan
summary_stats_rajasthan <- rajasthan_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
#Bhubaneshwar
summary_stats_bhubaneswar <- bhubaneswar_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
#Rourkela
summary_stats_rourkela <- rourkela_df %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Median_Tavg = median(tavg, na.rm = TRUE),
SD_Tavg = sd(tavg, na.rm = TRUE),
Mean_Prcp = mean(prcp, na.rm = TRUE),
Median_Prcp = median(prcp, na.rm = TRUE),
SD_Prcp = sd(prcp, na.rm = TRUE))
aqi_city_day$City <-as_factor(aqi_city_day$City)
aqi_city_day$AQI_Bucket <-as_factor(aqi_city_day$AQI_Bucket)
aqi_city_day%>%nrow()
## [1] 29531
str(aqi_city_day$City)
## Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_day$AQI_Bucket)
## Factor w/ 6 levels "Poor","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_day%>%summary()
## City Date PM2.5 PM10
## Ahmedabad: 2009 Min. :2015-01-01 Min. : 0.04 Min. : 0.01
## Bengaluru: 2009 1st Qu.:2017-04-16 1st Qu.: 28.82 1st Qu.: 56.26
## Chennai : 2009 Median :2018-08-05 Median : 48.57 Median : 95.68
## Delhi : 2009 Mean :2018-05-14 Mean : 67.45 Mean : 118.13
## Lucknow : 2009 3rd Qu.:2019-09-03 3rd Qu.: 80.59 3rd Qu.: 149.75
## Mumbai : 2009 Max. :2020-07-01 Max. :949.99 Max. :1000.00
## (Other) :17477 NA's :4598 NA's :11140
## NO NO2 NOx NH3
## Min. : 0.02 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 5.63 1st Qu.: 11.75 1st Qu.: 12.82 1st Qu.: 8.58
## Median : 9.89 Median : 21.69 Median : 23.52 Median : 15.85
## Mean : 17.57 Mean : 28.56 Mean : 32.31 Mean : 23.48
## 3rd Qu.: 19.95 3rd Qu.: 37.62 3rd Qu.: 40.13 3rd Qu.: 30.02
## Max. :390.68 Max. :362.21 Max. :467.63 Max. :352.89
## NA's :3582 NA's :3585 NA's :4185 NA's :10328
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.510 1st Qu.: 5.67 1st Qu.: 18.86 1st Qu.: 0.120
## Median : 0.890 Median : 9.16 Median : 30.84 Median : 1.070
## Mean : 2.249 Mean : 14.53 Mean : 34.49 Mean : 3.281
## 3rd Qu.: 1.450 3rd Qu.: 15.22 3rd Qu.: 45.57 3rd Qu.: 3.080
## Max. :175.810 Max. :193.86 Max. :257.73 Max. :455.030
## NA's :2059 NA's :3854 NA's :4022 NA's :5623
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.000 Min. : 0.00 Min. : 13.0 Poor :2781
## 1st Qu.: 0.600 1st Qu.: 0.14 1st Qu.: 81.0 Very Poor :2337
## Median : 2.970 Median : 0.98 Median : 118.0 Severe :1338
## Mean : 8.701 Mean : 3.07 Mean : 166.5 Moderate :8829
## 3rd Qu.: 9.150 3rd Qu.: 3.35 3rd Qu.: 208.0 Satisfactory:8224
## Max. :454.850 Max. :170.37 Max. :2049.0 Good :1341
## NA's :8041 NA's :18109 NA's :4681 NA's :4681
#City Hour
aqi_city_hour$City <-as_factor(aqi_city_hour$City)
aqi_city_hour$AQI_Bucket <-as_factor(aqi_city_hour$AQI_Bucket)
str(aqi_city_hour$City)
## Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_hour$AQI_Bucket)
## Factor w/ 6 levels "Poor","Moderate",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_hour%>%nrow()
## [1] 707875
aqi_city_hour%>%summary()
## City Datetime PM2.5
## Ahmedabad: 48192 Min. :2015-01-01 01:00:00.00 Min. : 0.01
## Bengaluru: 48192 1st Qu.:2017-04-15 23:00:00.00 1st Qu.: 26.20
## Chennai : 48192 Median :2018-08-04 20:00:00.00 Median : 46.42
## Delhi : 48192 Mean :2018-05-14 02:41:03.45 Mean : 67.62
## Lucknow : 48192 3rd Qu.:2019-09-02 14:00:00.00 3rd Qu.: 79.49
## Mumbai : 48192 Max. :2020-07-01 00:00:00.00 Max. : 999.99
## (Other) :418723 NA's :145088
## PM10 NO NO2 NOx
## Min. : 0.01 Min. : 0.01 Min. : 0.01 Min. : 0.00
## 1st Qu.: 52.38 1st Qu.: 3.84 1st Qu.: 10.81 1st Qu.: 10.66
## Median : 91.50 Median : 7.96 Median : 20.32 Median : 20.79
## Mean : 119.08 Mean : 17.42 Mean : 28.89 Mean : 32.29
## 3rd Qu.: 147.52 3rd Qu.: 16.15 3rd Qu.: 36.35 3rd Qu.: 37.15
## Max. :1000.00 Max. :499.99 Max. :499.51 Max. :498.61
## NA's :296737 NA's :116632 NA's :117122 NA's :123224
## NH3 CO SO2 O3
## Min. : 0.01 Min. : 0.00 Min. : 0.01 Min. : 0.01
## 1st Qu.: 8.12 1st Qu.: 0.42 1st Qu.: 4.88 1st Qu.: 13.42
## Median : 15.38 Median : 0.80 Median : 8.37 Median : 26.24
## Mean : 23.61 Mean : 2.18 Mean : 14.04 Mean : 34.80
## 3rd Qu.: 29.23 3rd Qu.: 1.37 3rd Qu.: 14.78 3rd Qu.: 47.62
## Max. :499.97 Max. :498.57 Max. :199.96 Max. :497.62
## NA's :272542 NA's :86517 NA's :130373 NA's :129208
## Benzene Toluene Xylene AQI
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 8.0
## 1st Qu.: 0.05 1st Qu.: 0.37 1st Qu.: 0.1 1st Qu.: 79.0
## Median : 0.86 Median : 2.59 Median : 0.8 Median : 116.0
## Mean : 3.09 Mean : 8.66 Mean : 3.1 Mean : 166.4
## 3rd Qu.: 2.75 3rd Qu.: 8.41 3rd Qu.: 3.1 3rd Qu.: 208.0
## Max. :498.07 Max. :499.40 Max. :500.0 Max. :3133.0
## NA's :163646 NA's :220607 NA's :455829 NA's :129080
## AQI_Bucket
## Poor : 66654
## Moderate :198991
## Very Poor : 57455
## Severe : 27650
## Satisfactory:189434
## Good : 38611
## NA's :129080
#Preliminary exploration of stnt_day data
#Convert the stationId and AQI_Bucket into factors
aqi_stnts$StationId <- as_factor(aqi_stnts$StationId)
aqi_stnts$City <- as_factor(aqi_stnts$City)
aqi_stnts$State <- as_factor(aqi_stnts$State)
aqi_stnts$Status <- as_factor(aqi_stnts$Status)
str(aqi_stnts$StationId)
## Factor w/ 230 levels "AP001","AP002",..: 1 2 3 4 5 6 7 8 9 10 ...
str(aqi_stnts$City)
## Factor w/ 127 levels "Amaravati","Rajamahendravaram",..: 1 2 3 4 5 6 7 7 8 9 ...
str(aqi_stnts$State)
## Factor w/ 21 levels "Andhra Pradesh",..: 1 1 1 1 1 2 3 3 3 3 ...
aqi_stnts %>% nrow()
## [1] 230
aqi_stnts %>% summary()
## StationId StationName City State
## AP001 : 1 Length:230 Delhi : 38 Delhi :38
## AP002 : 1 Class :character Bengaluru: 10 Haryana :29
## AP003 : 1 Mode :character Mumbai : 10 Uttar Pradesh :26
## AP004 : 1 Kolkata : 7 Maharashtra :22
## AP005 : 1 Patna : 6 Karnataka :20
## AS001 : 1 Hyderabad: 6 Madhya Pradesh:16
## (Other):224 (Other) :153 (Other) :79
## Status
## Active :131
## Inactive: 2
## NA's : 97
##
##
##
##
# Preliminary exploration of city_day data
# Convert the city and AQI_Bucket into factors
aqi_city_day$City <-as_factor(aqi_city_day$City)
aqi_city_day$AQI_Bucket <-as_factor(aqi_city_day$AQI_Bucket)
aqi_city_day%>%nrow()
## [1] 29531
str(aqi_city_day$City)
## Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_day$AQI_Bucket)
## Factor w/ 6 levels "Poor","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_day%>%summary()
## City Date PM2.5 PM10
## Ahmedabad: 2009 Min. :2015-01-01 Min. : 0.04 Min. : 0.01
## Bengaluru: 2009 1st Qu.:2017-04-16 1st Qu.: 28.82 1st Qu.: 56.26
## Chennai : 2009 Median :2018-08-05 Median : 48.57 Median : 95.68
## Delhi : 2009 Mean :2018-05-14 Mean : 67.45 Mean : 118.13
## Lucknow : 2009 3rd Qu.:2019-09-03 3rd Qu.: 80.59 3rd Qu.: 149.75
## Mumbai : 2009 Max. :2020-07-01 Max. :949.99 Max. :1000.00
## (Other) :17477 NA's :4598 NA's :11140
## NO NO2 NOx NH3
## Min. : 0.02 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 5.63 1st Qu.: 11.75 1st Qu.: 12.82 1st Qu.: 8.58
## Median : 9.89 Median : 21.69 Median : 23.52 Median : 15.85
## Mean : 17.57 Mean : 28.56 Mean : 32.31 Mean : 23.48
## 3rd Qu.: 19.95 3rd Qu.: 37.62 3rd Qu.: 40.13 3rd Qu.: 30.02
## Max. :390.68 Max. :362.21 Max. :467.63 Max. :352.89
## NA's :3582 NA's :3585 NA's :4185 NA's :10328
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.510 1st Qu.: 5.67 1st Qu.: 18.86 1st Qu.: 0.120
## Median : 0.890 Median : 9.16 Median : 30.84 Median : 1.070
## Mean : 2.249 Mean : 14.53 Mean : 34.49 Mean : 3.281
## 3rd Qu.: 1.450 3rd Qu.: 15.22 3rd Qu.: 45.57 3rd Qu.: 3.080
## Max. :175.810 Max. :193.86 Max. :257.73 Max. :455.030
## NA's :2059 NA's :3854 NA's :4022 NA's :5623
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.000 Min. : 0.00 Min. : 13.0 Poor :2781
## 1st Qu.: 0.600 1st Qu.: 0.14 1st Qu.: 81.0 Very Poor :2337
## Median : 2.970 Median : 0.98 Median : 118.0 Severe :1338
## Mean : 8.701 Mean : 3.07 Mean : 166.5 Moderate :8829
## 3rd Qu.: 9.150 3rd Qu.: 3.35 3rd Qu.: 208.0 Satisfactory:8224
## Max. :454.850 Max. :170.37 Max. :2049.0 Good :1341
## NA's :8041 NA's :18109 NA's :4681 NA's :4681
# Preliminary exploration of city_hour data
# Convert the city and AQI_Bucket into factors
aqi_city_hour$City <-as_factor(aqi_city_hour$City)
aqi_city_hour$AQI_Bucket <-as_factor(aqi_city_hour$AQI_Bucket)
str(aqi_city_hour$City)
## Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_hour$AQI_Bucket)
## Factor w/ 6 levels "Poor","Moderate",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_hour%>%nrow()
## [1] 707875
aqi_city_hour%>%summary()
## City Datetime PM2.5
## Ahmedabad: 48192 Min. :2015-01-01 01:00:00.00 Min. : 0.01
## Bengaluru: 48192 1st Qu.:2017-04-15 23:00:00.00 1st Qu.: 26.20
## Chennai : 48192 Median :2018-08-04 20:00:00.00 Median : 46.42
## Delhi : 48192 Mean :2018-05-14 02:41:03.45 Mean : 67.62
## Lucknow : 48192 3rd Qu.:2019-09-02 14:00:00.00 3rd Qu.: 79.49
## Mumbai : 48192 Max. :2020-07-01 00:00:00.00 Max. : 999.99
## (Other) :418723 NA's :145088
## PM10 NO NO2 NOx
## Min. : 0.01 Min. : 0.01 Min. : 0.01 Min. : 0.00
## 1st Qu.: 52.38 1st Qu.: 3.84 1st Qu.: 10.81 1st Qu.: 10.66
## Median : 91.50 Median : 7.96 Median : 20.32 Median : 20.79
## Mean : 119.08 Mean : 17.42 Mean : 28.89 Mean : 32.29
## 3rd Qu.: 147.52 3rd Qu.: 16.15 3rd Qu.: 36.35 3rd Qu.: 37.15
## Max. :1000.00 Max. :499.99 Max. :499.51 Max. :498.61
## NA's :296737 NA's :116632 NA's :117122 NA's :123224
## NH3 CO SO2 O3
## Min. : 0.01 Min. : 0.00 Min. : 0.01 Min. : 0.01
## 1st Qu.: 8.12 1st Qu.: 0.42 1st Qu.: 4.88 1st Qu.: 13.42
## Median : 15.38 Median : 0.80 Median : 8.37 Median : 26.24
## Mean : 23.61 Mean : 2.18 Mean : 14.04 Mean : 34.80
## 3rd Qu.: 29.23 3rd Qu.: 1.37 3rd Qu.: 14.78 3rd Qu.: 47.62
## Max. :499.97 Max. :498.57 Max. :199.96 Max. :497.62
## NA's :272542 NA's :86517 NA's :130373 NA's :129208
## Benzene Toluene Xylene AQI
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 8.0
## 1st Qu.: 0.05 1st Qu.: 0.37 1st Qu.: 0.1 1st Qu.: 79.0
## Median : 0.86 Median : 2.59 Median : 0.8 Median : 116.0
## Mean : 3.09 Mean : 8.66 Mean : 3.1 Mean : 166.4
## 3rd Qu.: 2.75 3rd Qu.: 8.41 3rd Qu.: 3.1 3rd Qu.: 208.0
## Max. :498.07 Max. :499.40 Max. :500.0 Max. :3133.0
## NA's :163646 NA's :220607 NA's :455829 NA's :129080
## AQI_Bucket
## Poor : 66654
## Moderate :198991
## Very Poor : 57455
## Severe : 27650
## Satisfactory:189434
## Good : 38611
## NA's :129080
#Preliminary exploration of stnt_day data
#Convert the stationId and AQI_Bucket into factors
aqi_stnt_day$StationId <-as_factor(aqi_stnt_day$StationId)
aqi_stnt_day$AQI_Bucket <-as_factor(aqi_stnt_day$AQI_Bucket)
str(aqi_stnt_day$StationId)
## Factor w/ 110 levels "AP001","AP005",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_stnt_day$AQI_Bucket)
## Factor w/ 6 levels "Moderate","Poor",..: NA 1 1 1 1 1 1 1 1 2 ...
aqi_stnt_day%>%nrow()
## [1] 108035
aqi_stnt_day%>%summary()
## StationId Date PM2.5 PM10
## DL007 : 2009 Min. :2015-01-01 Min. : 0.02 Min. : 0.01
## DL008 : 2009 1st Qu.:2017-10-14 1st Qu.: 31.88 1st Qu.: 70.15
## DL013 : 2009 Median :2018-12-02 Median : 55.95 Median : 122.09
## DL021 : 2009 Mean :2018-08-17 Mean : 80.27 Mean : 157.97
## DL033 : 2009 3rd Qu.:2019-10-16 3rd Qu.: 99.92 3rd Qu.: 208.67
## GJ001 : 2009 Max. :2020-07-01 Max. :1000.00 Max. :1000.00
## (Other):95981 NA's :21625 NA's :42706
## NO NO2 NOx NH3
## Min. : 0.01 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 4.84 1st Qu.: 15.09 1st Qu.: 13.97 1st Qu.: 11.90
## Median : 10.29 Median : 27.21 Median : 26.66 Median : 23.59
## Mean : 23.12 Mean : 35.24 Mean : 41.20 Mean : 28.73
## 3rd Qu.: 24.98 3rd Qu.: 46.93 3rd Qu.: 50.50 3rd Qu.: 38.14
## Max. :470.00 Max. :448.05 Max. :467.63 Max. :418.90
## NA's :17106 NA's :16547 NA's :15500 NA's :48105
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.530 1st Qu.: 5.04 1st Qu.: 18.89 1st Qu.: 0.160
## Median : 0.910 Median : 8.95 Median : 30.84 Median : 1.210
## Mean : 1.606 Mean : 12.26 Mean : 38.13 Mean : 3.358
## 3rd Qu.: 1.450 3rd Qu.: 14.92 3rd Qu.: 47.14 3rd Qu.: 3.610
## Max. :175.810 Max. :195.65 Max. :963.00 Max. :455.030
## NA's :12998 NA's :25204 NA's :25568 NA's :31455
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.00 Min. : 0.00 Min. : 8.0 Moderate :29417
## 1st Qu.: 0.69 1st Qu.: 0.00 1st Qu.: 86.0 Poor :11493
## Median : 4.33 Median : 0.40 Median : 132.0 Very Poor :11762
## Mean : 15.35 Mean : 2.42 Mean : 179.7 Satisfactory:23636
## 3rd Qu.: 17.51 3rd Qu.: 2.11 3rd Qu.: 254.0 Good : 5510
## Max. :454.85 Max. :170.37 Max. :2049.0 Severe : 5207
## NA's :38702 NA's :85137 NA's :21010 NA's :21010
#Preliminary exploration of stnt_day data
#Convert the stationId and AQI_Bucket into factors
aqi_stnt_hour$StationId <-as_factor(aqi_stnt_hour$StationId)
aqi_stnt_hour$AQI_Bucket <-as_factor(aqi_stnt_hour$AQI_Bucket)
str(aqi_stnt_hour$StationId)
## Factor w/ 110 levels "AP001","AP005",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_stnt_hour$AQI_Bucket)
## Factor w/ 6 levels "Moderate","Poor",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_stnt_hour%>%nrow()
## [1] 2589083
aqi_stnt_hour%>%summary()
## StationId Datetime PM2.5
## DL007 : 48192 Min. :2015-01-01 01:00:00.00 Min. : 0.0
## DL008 : 48192 1st Qu.:2017-10-13 20:00:00.00 1st Qu.: 28.2
## DL013 : 48192 Median :2018-12-02 06:00:00.00 Median : 52.6
## DL021 : 48192 Mean :2018-08-17 09:52:35.77 Mean : 80.9
## DL033 : 48192 3rd Qu.:2019-10-15 06:00:00.00 3rd Qu.: 97.7
## GJ001 : 48192 Max. :2020-07-01 00:00:00.00 Max. :1000.0
## (Other):2299931 NA's :647689
## PM10 NO NO2 NOx
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 64.0 1st Qu.: 3.0 1st Qu.: 13.1 1st Qu.: 11.3
## Median : 116.2 Median : 7.2 Median : 24.8 Median : 22.9
## Mean : 158.5 Mean : 22.8 Mean : 35.2 Mean : 40.6
## 3rd Qu.: 204.0 3rd Qu.: 18.6 3rd Qu.: 45.5 3rd Qu.: 45.7
## Max. :1000.0 Max. :500.0 Max. :500.0 Max. :500.0
## NA's :1119252 NA's :553711 NA's :528973 NA's :490808
## NH3 CO SO2 O3
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 11.2 1st Qu.: 0.4 1st Qu.: 4.2 1st Qu.: 11.0
## Median : 22.4 Median : 0.8 Median : 8.2 Median : 24.8
## Mean : 28.7 Mean : 1.5 Mean : 12.1 Mean : 38.1
## 3rd Qu.: 37.8 3rd Qu.: 1.4 3rd Qu.: 14.5 3rd Qu.: 49.5
## Max. :500.0 Max. :498.6 Max. :200.0 Max. :997.0
## NA's :1236618 NA's :499302 NA's :742737 NA's :725973
## Benzene Toluene Xylene AQI
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 5.0
## 1st Qu.: 0.1 1st Qu.: 0.3 1st Qu.: 0.0 1st Qu.: 84.0
## Median : 1.0 Median : 3.4 Median : 0.2 Median : 131.0
## Mean : 3.3 Mean : 14.9 Mean : 2.4 Mean : 180.2
## 3rd Qu.: 3.2 3rd Qu.: 15.1 3rd Qu.: 1.8 3rd Qu.: 259.0
## Max. :498.1 Max. :500.0 Max. :500.0 Max. :3133.0
## NA's :861579 NA's :1042366 NA's :2075104 NA's :570190
## AQI_Bucket
## Moderate :675008
## Poor :239990
## Very Poor :301150
## Satisfactory:530164
## Good :152113
## Severe :120468
## NA's :570190
AQI_Bucket has been categorized into six levels: Good, Satisfactory, Moderate, Poor, Very Poor, and Severe.city_day, city_hour, station_day, and station_hour datasets.city_day and city_hour datasets.Station_day and station_hour datasets are consistent with 110 station IDs each.AQI_Bucket is also NA, indicating no further data cleaning is needed for these fields in terms of their interrelation.stations table includes 230 stations, more than the 110 in station_day and station_hour datasets.station_day and station_hour datasets include data from a subset of stations.city_day and station_day datasets.The focus will be on uncovering insights into air quality trends and their temporal and spatial dynamics, along with the influence of environmental factors.
In this section, we focus on preprocessing and analyzing the Temperature & Precipitation dataset to discern climatic trends across various Indian cities. Key steps include merging geographic data (latitude, longitude, elevation) and consolidating data from individual cities into one comprehensive dataframe. We enhance the dataset with temporal features (month, year), geographical classifications (Coastal/Non-Coastal regions based on elevation), and seasonal categories (Summer, Winter, Rainy). Additionally, we identify day types (weekdays/weekends) and integrate city information, transforming the City field into a factor.
The processing extends to merging each city’s dataset with geolocation data, followed by combining these into a single merged_weather dataframe. We classify cities into Coastal or Non-Coastal regions and clean the data by removing rows with missing values in key columns. For a more granular analysis, we compute monthly averages of temperature and precipitation across different years.
Our analysis includes visualizations and trend examinations of annual precipitation and temperature across cities. We observe general trends like increasing precipitation since 2004 and rising temperatures, particularly in Delhi and Lucknow. The analysis also reveals distinct climatic differences between Coastal and Non-Coastal cities, with Coastal regions exhibiting higher temperatures and precipitation levels. This comprehensive exploration provides valuable insights into the geographical impact on climate patterns, highlighting significant variances in temperature and precipitation across different regions.
library(ggplot2)
# Function to calculate annual trends
calculate_annual_trends <- function(df) {
df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
}
annual_trends_bangalore <- bangalore_df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
# Temperature Trend Plot for Bangalore
ggplot(annual_trends_bangalore, aes(x = Year, y = Mean_Tavg)) +
geom_point() +
geom_line() +
labs(title = "Annual Mean Temperature Trend in Bangalore (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Precipitation Trend Plot for Bangalore
ggplot(annual_trends_bangalore, aes(x = Year, y = Total_Prcp)) +
geom_point() +
geom_line() +
labs(title = "Annual Total Precipitation Trend in Bangalore (2015-2020)",
x = "Year",
y = "Total Precipitation (mm)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Annual trends for Chennai
annual_trends_chennai <- chennai_df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
# Temperature Trend Plot for Chennai
ggplot(annual_trends_chennai, aes(x = Year, y = Mean_Tavg)) +
geom_point() +
geom_line() +
labs(title = "Annual Mean Temperature Trend in Chennai (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Precipitation Trend Plot for Chennai
ggplot(annual_trends_chennai, aes(x = Year, y = Total_Prcp)) +
geom_point() +
geom_line() +
labs(title = "Annual Total Precipitation Trend in Chennai (2015-2020)",
x = "Year",
y = "Total Precipitation (mm)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Annual trends for Delhi
annual_trends_delhi <- delhi_df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
# Temperature Trend Plot for Delhi
ggplot(annual_trends_delhi, aes(x = Year, y = Mean_Tavg)) +
geom_point() +
geom_line() +
labs(title = "Annual Mean Temperature Trend in Delhi (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Precipitation Trend Plot for Delhi
ggplot(annual_trends_delhi, aes(x = Year, y = Total_Prcp)) +
geom_point() +
geom_line() +
labs(title = "Annual Total Precipitation Trend in Delhi (2015-2020)",
x = "Year",
y = "Total Precipitation (mm)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Annual trends for Lucknow
annual_trends_lucknow <- lucknow_df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
# Temperature Trend Plot for Lucknow
ggplot(annual_trends_lucknow, aes(x = Year, y = Mean_Tavg)) +
geom_point() +
geom_line() +
labs(title = "Annual Mean Temperature Trend in Lucknow (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Precipitation Trend Plot for Lucknow
ggplot(annual_trends_lucknow, aes(x = Year, y = Total_Prcp)) +
geom_point() +
geom_line() +
labs(title = "Annual Total Precipitation Trend in Lucknow (2015-2020)",
x = "Year",
y = "Total Precipitation (mm)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Annual trends for Mumbai
annual_trends_mumbai <- mumbai_df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
# Temperature Trend Plot for Mumbai
ggplot(annual_trends_mumbai, aes(x = Year, y = Mean_Tavg)) +
geom_point() +
geom_line() +
labs(title = "Annual Mean Temperature Trend in Mumbai (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Precipitation Trend Plot for Mumbai
ggplot(annual_trends_mumbai, aes(x = Year, y = Total_Prcp)) +
geom_point() +
geom_line() +
labs(title = "Annual Total Precipitation Trend in Mumbai (2015-2020)",
x = "Year",
y = "Total Precipitation (mm)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Annual trends for Rajasthan
annual_trends_rajasthan <- rajasthan_df %>%
group_by(Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE))
# Temperature Trend Plot for Rajasthan
ggplot(annual_trends_rajasthan, aes(x = Year, y = Mean_Tavg)) +
geom_point() +
geom_line() +
labs(title = "Annual Mean Temperature Trend in Rajasthan (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Precipitation Trend Plot for Rajasthan
ggplot(annual_trends_rajasthan, aes(x = Year, y = Total_Prcp)) +
geom_point() +
geom_line() +
labs(title = "Annual Total Precipitation Trend in Rajasthan (2015-2020)",
x = "Year",
y = "Total Precipitation (mm)") +
theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Select and rename columns (if needed) for each city
bangalore_df <- bangalore_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Bangalore")
chennai_df <- chennai_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Chennai")
delhi_df <- delhi_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Delhi")
lucknow_df <- lucknow_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Lucknow")
mumbai_df <- mumbai_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Mumbai")
rajasthan_df <- rajasthan_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Rajasthan")
bhubaneswar_df <- bhubaneswar_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Bhubaneswar")
rourkela_df <- rourkela_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Rourkela")
# Combine all city datasets into one dataframe
all_cities_df <- rbind(bangalore_df, chennai_df, delhi_df, lucknow_df, mumbai_df, rajasthan_df, bhubaneswar_df, rourkela_df)
# Boxplot for average temperatures
ggplot(all_cities_df, aes(x = City, y = tavg, fill = City)) +
geom_boxplot() +
labs(title = "Comparison of Average Temperatures Across Cities",
x = "City",
y = "Average Temperature (°C)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 78 rows containing non-finite values (`stat_boxplot()`).
# Boxplot for minimum temperatures
ggplot(all_cities_df, aes(x = City, y = tmin, fill = City)) +
geom_boxplot() +
labs(title = "Comparison of Minimum Temperatures Across Cities",
x = "City",
y = "Minimum Temperature (°C)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 2090 rows containing non-finite values (`stat_boxplot()`).
# Boxplot for maximum temperatures
ggplot(all_cities_df, aes(x = City, y = tmax, fill = City)) +
geom_boxplot() +
labs(title = "Comparison of Maximum Temperatures Across Cities",
x = "City",
y = "Maximum Temperature (°C)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 891 rows containing non-finite values (`stat_boxplot()`).
# Creating boxplots for precipitation data
ggplot(all_cities_df, aes(x = City, y = prcp, fill = City)) +
geom_boxplot() +
labs(title = "Comparison of Precipitation Among Cities (2015-2020)",
x = "City",
y = "Precipitation (mm)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 5097 rows containing non-finite values (`stat_boxplot()`).
# Calculate annual mean temperatures for each city
annual_mean_temps <- all_cities_df %>%
group_by(City, Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Plotting the temperature trends
ggplot(annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
geom_line() +
labs(title = "Year-Over-Year Mean Temperature Trends Across Cities (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal() +
theme(legend.position = "bottom")
# Plotting temperature trends with improved axis readability
ggplot(annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
geom_line() +
labs(title = "Year-Over-Year Mean Temperature Trends Across Cities (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)") +
theme_minimal() +
theme(legend.position = "bottom") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Adjusting x-axis labels
# Calculate total annual precipitation for each city
annual_precipitation <- all_cities_df %>%
group_by(City, Year = format(time, "%Y")) %>%
summarise(Total_Prcp = sum(prcp, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Base plot with line plot for temperatures
p <- ggplot() +
geom_line(data = annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
labs(title = "Annual Mean Temperature and Total Precipitation Trends (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)")
# Adding the bar plot for precipitation
p + geom_bar(data = annual_precipitation, aes(x = Year, y = Total_Prcp, fill = City), stat = "identity", position = "dodge", alpha = 0.5) +
scale_y_continuous(sec.axis = sec_axis(~ . / 10, name = "Total Precipitation (mm)")) # Adjust scale and axis label for precipitation
# Base plot with line plot for temperatures
p <- ggplot() +
geom_line(data = annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
labs(title = "Annual Mean Temperature and Total Precipitation Trends (2015-2020)",
x = "Year",
y = "Mean Temperature (°C)")
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
# Function to assign seasons to months
get_season <- function(month) {
case_when(
month %in% c(3, 4, 5) ~ "Spring",
month %in% c(6, 7, 8) ~ "Summer",
month %in% c(9, 10, 11) ~ "Autumn",
month %in% c(12, 1, 2) ~ "Winter"
)
}
# Adding a Season column to the dataset
all_cities_df <- all_cities_df %>%
mutate(Month = month(time),
Season = get_season(Month))
# Calculating seasonal mean temperature and total precipitation
seasonal_stats <- all_cities_df %>%
group_by(City, Season) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE), .groups = 'drop')
# Plotting Seasonal Temperature Variations
ggplot(seasonal_stats, aes(x = Season, y = Mean_Tavg, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Seasonal Mean Temperature Variations Across Cities",
x = "Season",
y = "Mean Temperature (°C)") +
theme_minimal() +
theme(legend.position = "bottom")
# Plotting Seasonal Precipitation Variations
ggplot(seasonal_stats, aes(x = Season, y = Total_Prcp, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Seasonal Total Precipitation Variations Across Cities",
x = "Season",
y = "Total Precipitation (mm)") +
theme_minimal() +
theme(legend.position = "bottom")
library(dplyr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(DT)
# Calculating annual mean temperature and total precipitation for each city
annual_climate_data <- all_cities_df %>%
group_by(City, Year = format(time, "%Y")) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
Total_Prcp = sum(prcp, na.rm = TRUE), .groups = 'drop')
# Enhanced Correlation Plot
ggpairs(annual_climate_data, columns = c("Mean_Tavg", "Total_Prcp"), ggplot2::aes(colour = City)) +
labs(title = "Enhanced Correlation Matrix Between Mean Temperature and Total Precipitation Across Cities")
# Interactive Table of Correlations
annual_climate_data %>%
group_by(City) %>%
summarise(Correlation = cor(Mean_Tavg, Total_Prcp, use = "complete.obs")) %>%
datatable(options = list(pageLength = 10))
First, define what constitutes an “extreme” event. This might vary based on the city and the type of event (temperature, rainfall, etc.).
For example, you might consider a day with a temperature above the 95th percentile as extremely hot, or a day with rainfall above the 95th percentile as a day of heavy rainfall.
# Define thresholds for extreme events
temperature_threshold <- quantile(all_cities_df$tavg, 0.95, na.rm = TRUE)
rainfall_threshold <- quantile(all_cities_df$prcp, 0.95, na.rm = TRUE)
# Identify extreme temperature events
all_cities_df$extreme_temp <- all_cities_df$tavg > temperature_threshold
# Identify extreme rainfall events
all_cities_df$extreme_rain <- all_cities_df$prcp > rainfall_threshold
# Analyze extreme temperature events
extreme_temp_analysis <- all_cities_df %>%
group_by(Year = format(time, "%Y"), City) %>%
summarise(Extreme_Temp_Days = sum(extreme_temp, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Analyze extreme rainfall events
extreme_rain_analysis <- all_cities_df %>%
group_by(Year = format(time, "%Y"), City) %>%
summarise(Extreme_Rain_Days = sum(extreme_rain, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Plotting extreme temperature trends
ggplot(extreme_temp_analysis, aes(x = Year, y = Extreme_Temp_Days, group = City, color = City)) +
geom_line() +
labs(title = "Yearly Trends of Extreme Temperature Days",
x = "Year",
y = "Number of Extreme Temperature Days")
# Plotting extreme rainfall trends
ggplot(extreme_rain_analysis, aes(x = Year, y = Extreme_Rain_Days, group = City, color = City)) +
geom_line() +
labs(title = "Yearly Trends of Extreme Rainfall Days",
x = "Year",
y = "Number of Extreme Rainfall Days")
library(dplyr)
library(lubridate)
# Extract the month from the date
all_cities_df$Month <- format(as.Date(all_cities_df$time), "%m")
all_cities_df$Month <- as.integer(all_cities_df$Month)
# Monthly analysis of extreme temperature events
monthly_extreme_temp <- all_cities_df %>%
group_by(Month, City) %>%
summarise(Extreme_Temp_Days = sum(extreme_temp, na.rm = TRUE))
## `summarise()` has grouped output by 'Month'. You can override using the
## `.groups` argument.
# Monthly analysis of extreme rainfall events
monthly_extreme_rain <- all_cities_df %>%
group_by(Month, City) %>%
summarise(Extreme_Rain_Days = sum(extreme_rain, na.rm = TRUE))
## `summarise()` has grouped output by 'Month'. You can override using the
## `.groups` argument.
# Plotting monthly extreme temperature trends
ggplot(monthly_extreme_temp, aes(x = Month, y = Extreme_Temp_Days, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Monthly Distribution of Extreme Temperature Days",
x = "Month",
y = "Number of Extreme Temperature Days") +
scale_x_continuous(breaks = 1:12, labels = month.abb)
# Plotting monthly extreme rainfall trends
ggplot(monthly_extreme_rain, aes(x = Month, y = Extreme_Rain_Days, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Monthly Distribution of Extreme Rainfall Days",
x = "Month",
y = "Number of Extreme Rainfall Days") +
scale_x_continuous(breaks = 1:12, labels = month.abb)
# Prepare the data for long-term trend analysis
# Ensure all datasets have a Year column
# Add Year column to each city's dataset
bangalore_df$Year <- format(bangalore_df$time, "%Y")
chennai_df$Year <- format(chennai_df$time, "%Y")
delhi_df$Year <- format(delhi_df$time, "%Y")
lucknow_df$Year <- format(lucknow_df$time, "%Y")
mumbai_df$Year <- format(mumbai_df$time, "%Y")
rajasthan_df$Year <- format(rajasthan_df$time, "%Y")
bhubaneswar_df$Year <- format(bhubaneswar_df$time, "%Y")
rourkela_df$Year <- format(rourkela_df$time, "%Y")
# Combine all datasets into one dataframe
all_cities_df <- rbind(
bangalore_df %>% mutate(City = "Bangalore"),
chennai_df %>% mutate(City = "Chennai"),
delhi_df %>% mutate(City = "Delhi"),
lucknow_df %>% mutate(City = "Lucknow"),
mumbai_df %>% mutate(City = "Mumbai"),
rajasthan_df %>% mutate(City = "Rajasthan"),
bhubaneswar_df %>% mutate(City = "Bhubaneswar"),
rourkela_df %>% mutate(City = "Rourkela")
)
all_cities_long_term <- all_cities_df %>%
mutate(Year = as.numeric(Year),
Decade = case_when(
Year >= 1990 & Year < 2000 ~ "1990s",
Year >= 2000 & Year < 2010 ~ "2000s",
Year >= 2010 & Year <= 2022 ~ "2010s"
))
# Decadal temperature trends
decadal_temp_trends <- all_cities_long_term %>%
group_by(City, Decade) %>%
summarise(Mean_Tavg = mean(tavg, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Decadal precipitation trends
decadal_precip_trends <- all_cities_long_term %>%
group_by(City, Decade) %>%
summarise(Total_Prcp = sum(prcp, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Temperature trend plot
ggplot(decadal_temp_trends, aes(x = Decade, y = Mean_Tavg, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Decadal Average Temperature Trends",
x = "Decade",
y = "Mean Temperature (°C)")
# Precipitation trend plot
ggplot(decadal_precip_trends, aes(x = Decade, y = Total_Prcp, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Decadal Total Precipitation Trends",
x = "Decade",
y = "Total Precipitation (mm)")
In this phase, we analyze temperature and precipitation trends to identify long-term climatic changes. We scrutinize average temperatures across decades for any upward or downward trends, with an increasing trend possibly indicating global warming. Similarly, we examine precipitation patterns over the years to detect any shifts in rainfall. This analysis spans different cities, accounting for their unique geographic and climatic characteristics. It’s important to note, however, that these trends suggest potential changes but don’t confirm causation, as climate change is driven by various complex factors. Through this method, we aim to capture an overarching view of how climate parameters have shifted over the past three decades, shedding light on broader trends in climate change.
# Example using Bangalore's data
# Creating new features based on the date
bangalore_df <- bangalore_df %>%
mutate(Month = format(time, "%m"),
Day = format(time, "%d"),
DayOfYear = yday(time))
# Removing the original 'time' column
bangalore_df <- select(bangalore_df, -time)
# Splitting the data into training and testing sets
# Assuming 80% training, 20% testing split
set.seed(123) # For reproducibility
training_indices <- sample(1:nrow(bangalore_df), 0.8 * nrow(bangalore_df))
train_data <- bangalore_df[training_indices, ]
test_data <- bangalore_df[-training_indices, ]
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# Using Random Forest for rainfall prediction
rf_model <- randomForest(prcp ~ ., data = train_data)
# Summarizing the model
print(rf_model)
##
## Call:
## randomForest(formula = prcp ~ ., data = train_data)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 80.49238
## % Var explained: 9.4
# Making predictions on the test data
predictions <- predict(rf_model, test_data)
# Using Mean Absolute Error (MAE) for evaluation
mae <- mean(abs(predictions - test_data$prcp))
print(paste("Mean Absolute Error: ", mae))
## [1] "Mean Absolute Error: 3.79870041260058"
Mean of Squared Residuals: This is about 80.49, which gives a sense of the average squared difference between the observed actual outcomes and the values predicted by the model.
The model explains around 9.4% of the variance in the rainfall data. This is relatively low, suggesting that the model might not be capturing all the complexities and patterns in the rainfall data.
An MAE of 3.80 suggests that, on average, the model’s predictions are about 3.80 units (presumably millimeters if the rainfall is measured that way) away from the actual values.
library(rpart)
# Building the decision tree model
dt_model <- rpart(prcp ~ ., data = train_data, method = "anova")
# Printing the model
print(dt_model)
## n= 1753
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1753 155748.900 6.044540
## 2) tmin>=20.75 580 18999.970 4.175578 *
## 3) tmin< 20.75 1173 133721.200 6.968665
## 6) Month=01,02,03,06,07,08,09,10,11,12 1125 110491.400 6.424355
## 12) Month=01,02,03,06,07,08,11,12 902 53517.360 5.626395
## 24) tavg>=19.65 895 48763.280 5.503594
## 48) Day=01,02,03,04,05,06,07,08,09,10,11,13,14,18,19,20,22,23,27,28,29,30,31 654 11950.970 4.783881 *
## 49) Day=12,15,16,17,21,24,25,26 241 35554.250 7.456676
## 98) Month=01,02,03,06,07,11,12 208 15600.370 6.475991 *
## 99) Month=08 33 18492.960 13.637960
## 198) Year=2016,2018,2019,2020 21 1633.101 5.788623 *
## 199) Year=2015,2017 12 13301.760 27.374300 *
## 25) tavg< 19.65 7 3014.946 21.327370 *
## 13) Month=09,10 223 54076.570 9.651976
## 26) Year=2015,2016,2018,2019,2020 184 24055.390 7.484868
## 52) Day=01,02,03,04,05,06,07,10,12,13,14,15,16,17,19,20,22,23,26,27,29,31 133 5506.941 5.388758 *
## 53) Day=08,09,11,18,21,24,25,28,30 51 16440.170 12.951190
## 106) tavg>=23.25 34 3662.822 6.785612 *
## 107) tavg< 23.25 17 8899.885 25.282350 *
## 27) Year=2017 39 25080.130 19.876280
## 54) Day=05,08,11,12,14,16,17,18,19,20,21,22,23,24,26,28,29,30,31 25 1587.206 6.438995 *
## 55) Day=01,02,04,06,09,10,13,15,25,27 14 10918.170 43.871430 *
## 7) Month=04,05 48 15084.630 19.725920
## 14) Day=01,02,03,08,09,10,11,12,13,14,15,16,18,26,29,30,31 30 2792.180 10.878140 *
## 15) Day=04,17,19,20,21,23,24,27,28 18 6029.796 34.472220 *
# Making predictions on the test data
predictions_dt <- predict(dt_model, test_data)
# Using Mean Absolute Error (MAE) for evaluation
mae_dt <- mean(abs(predictions_dt - test_data$prcp))
print(paste("Mean Absolute Error: ", mae_dt))
## [1] "Mean Absolute Error: 4.27886547420482"
# Additional evaluation metrics - Root Mean Square Error (RMSE)
rmse_dt <- sqrt(mean((predictions_dt - test_data$prcp)^2))
print(paste("Root Mean Square Error: ", rmse_dt))
## [1] "Root Mean Square Error: 7.6573603613743"
library(rpart.plot)
# Plotting the decision tree
rpart.plot(dt_model, main = "Decision Tree for Rainfall Prediction")
# Enhanced plotting of the decision tree
rpart.plot(dt_model,
main = "Decision Tree for Rainfall Prediction",
type = 4, # Enhanced tree type with split labels, variable names, and fitted values
extra = 101, # Display the number of observations in each node
under = TRUE, # Place node labels under the node (instead of inside it)
faclen = 0, # Full factor levels in split labels
cex = 0.6, # Size of text
tweak = 1.2) # Adjust size and spacing for a cleaner look
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
## Warning: cex and tweak both specified, applying both
library(leaflet)
# Ensure the column names are correctly referenced
station_geo_df$Latitude <- as.numeric(station_geo_df$Latitude)
station_geo_df$longitude <- as.numeric(station_geo_df$longitude)
# Create a leaflet map
leaflet(station_geo_df) %>%
addTiles() %>%
addMarkers(~longitude, ~Latitude, popup = ~Location_Name)
# Creating a color palette
pal <- colorNumeric(palette = "viridis", domain = station_geo_df$Elevation)
leaflet(station_geo_df) %>%
addTiles() %>%
addCircleMarkers(~longitude, ~Latitude,
popup = ~paste(Location_Name, "Elevation:", Elevation, "m"),
color = ~pal(Elevation), fill = TRUE)
library(leaflet)
library(leaflet.extras)
# Calculating average rainfall for each city
average_rainfall <- all_cities_df %>%
group_by(City) %>%
summarise(Avg_Rainfall = mean(prcp, na.rm = TRUE))
# Merging with station_geo_df to include geographic coordinates
station_geo_df <- merge(station_geo_df, average_rainfall, by.x = "Location_Name", by.y = "City")
average_rainfall
## # A tibble: 7 × 2
## City Avg_Rainfall
## <chr> <dbl>
## 1 Bangalore 5.93
## 2 Bhubaneswar 7.07
## 3 Chennai 11.5
## 4 Delhi 6.18
## 5 Lucknow 8.70
## 6 Mumbai 23.0
## 7 Rajasthan 5.93
leaflet(station_geo_df) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addHeatmap(lng = ~longitude, lat = ~Latitude, intensity = ~Avg_Rainfall, radius = 20, blur = 15)
This section focuses on the seamless integration of air quality and weather data, enriched with geospatial coordinates and detailed city profiles. By blending these diverse datasets, we create a comprehensive perspective that not only assesses environmental parameters but also considers the geographical context of Indian cities. This fusion enables us to understand the intricate relationship between air quality, weather patterns, and their geographical variations across India.
#Convert the Date to date type
aqi_city_day$Date = as_date(aqi_city_day$Date, format='%Y-%m-%d')
#Extract month and year as additional columns
merged <- aqi_city_day
merged <- merged %>% mutate(Month = month(Date))
merged <- merged %>% mutate(Year = year(Date))
merged <- merged %>% mutate(Day = wday(Date, label=TRUE, abbr=FALSE))
#Import Indian Cities database
indian_cities$City = as_factor(indian_cities$City)
#Merge Lat-Long into aqi_day
merged <- merge(merged, indian_cities%>%select("City", "Lat", "Long"), by="City")
#Introduce new column for paritioning into N/S, where North >22.5Lat
merged <-merged %>% mutate(Region = ifelse(Lat>22.5,"North","South"))
#Introduce column for season. Summer:03-06, Rainy:07-10, Winter:11-02
merged <- merged %>% mutate(Season = case_when(Month %in% c(3,4,5,6)~"Summer" , Month %in% c(7,8,9,10) ~"Rainy", Month %in% c(11,12,1,2) ~"Winter" ) )
#Introduce column for partitioning into weekday and weekend. Weekend = Saturday, Sunday; Weekday= others
merged <- merged %>% mutate(DayType = ifelse(Day %in% c("Sunday", "Saturday"), "Weekend", "Weekday"))
yearly_summary <-merged %>% group_by(Year, City) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
yearly_summary <- merge(yearly_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")
seasonal_summary <-merged %>% group_by(Year, City, Season) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE))
## `summarise()` has grouped output by 'Year', 'City'. You can override using the
## `.groups` argument.
seasonal_summary <- merge(seasonal_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")
regional_summary <-merged %>% group_by(Year, City, Region) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE))
## `summarise()` has grouped output by 'Year', 'City'. You can override using the
## `.groups` argument.
regional_summary <- merge(regional_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")
regional_seasonal_summary <-merged %>% group_by(Year, Region, Season, City) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE))
## `summarise()` has grouped output by 'Year', 'Region', 'Season'. You can
## override using the `.groups` argument.
regional_seasonal_summary <- merge(regional_seasonal_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")
weektype_summary <-merged %>% group_by(Year, Month, City, DayType) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE))
## `summarise()` has grouped output by 'Year', 'Month', 'City'. You can override
## using the `.groups` argument.
weektype_summary <- merge(weektype_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")
ggplot(yearly_summary, aes(x=Year, y=avg_AQI))+geom_line()+facet_wrap(~City, ncol=6)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
m_2015<-yearly_summary%>%filter(Year==2015)
m_2016<-yearly_summary%>%filter(Year==2016)
m_2017<-yearly_summary%>%filter(Year==2017)
m_2018<-yearly_summary%>%filter(Year==2018)
m_2019<-yearly_summary%>%filter(Year==2019)
m_2020<-yearly_summary%>%filter(Year==2020)
yearly_summary %>%
leaflet()%>%
addProviderTiles("CartoDB")%>%
addCircleMarkers(data=m_2015,radius=~avg_AQI/10,popup=~City, group="2015")%>%
addCircleMarkers(data=m_2016,radius=~avg_AQI/10,popup=~City, group="2016")%>%
addCircleMarkers(data=m_2017,radius=~avg_AQI/10,popup=~City, group="2017")%>%
addCircleMarkers(data=m_2018,radius=~avg_AQI/10,popup=~City, group="2018")%>%
addCircleMarkers(data=m_2019,radius=~avg_AQI/10,popup=~City, group="2019")%>%
addCircleMarkers(data=m_2020,radius=~avg_AQI/10,popup=~City, group="2020")%>%
addLayersControl(overlayGroups=c("2015","2016","2017","2018","2019","2020"), layersControlOptions(collapsed=FALSE))-> map
map <- map %>% hideGroup(2016) %>% hideGroup(2017) %>% hideGroup(2018) %>% hideGroup(2019)%>% hideGroup(2020)
map
seasonal_summary <-merged %>% group_by(Year, City, Season) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE))
## `summarise()` has grouped output by 'Year', 'City'. You can override using the
## `.groups` argument.
seasonal_summary <- merge(seasonal_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")
ggplot(data = seasonal_summary,aes(x=Year, y=avg_AQI, color=Season))+geom_point()+facet_wrap(~City, ncol=3)
## Warning: Removed 15 rows containing missing values (`geom_point()`).
# Capture the seasonal variation for only six cities
filtered <- seasonal_summary%>%
filter(City %in% c("Ahmedabad", "Delhi", "Patna", "Mumbai","Bengaluru"))
ggplot(data = filtered,aes(x=Year, y=avg_AQI, color=Season))+
geom_point()+
facet_wrap(~City, ncol=1, scales="free_y")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## Warning: Removed 11 rows containing missing values (`geom_point()`).
ggplot(data=t1, aes(x=State, fill=City))+geom_bar()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")
summary_merged_weather <-merge(summary_merged_weather, geolocations, by.x = "City", by.y="Location_Name")
#Create a region columne where region=Coastal if elevation < 100, else Noncoastal
summary_merged_weather <- summary_merged_weather %>% mutate(Region = ifelse(Elevation > 100, "Noncoastal", "Coastal"))
#Plot the annual varation in temp and prcp based on coastal and non-coastal
summary_merged_weather %>% filter(Year %in% seq(1990,1999))%>%
group_by(Year, Region, City) %>%
ggplot(aes(x=City, y=t_avg_annual, color=Region)) +
geom_point() +
facet_wrap(~Year, ncol=5)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
gathered_df%>%filter(City %in% c("Ahmedabad", "Bengaluru", "Mumbai", "Delhi", "Patna")) %>%
filter(!Pollutant %in% c("PM2.5","PM10")) %>%
group_by(City,Year)%>%
ggplot(aes(x=Pollutant, y=Avg, color=Pollutant, na.rm=TRUE))+
geom_col(aes(fill=Pollutant))+
facet_grid(vars(City),vars(Year), scales="free_y")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## Warning: Removed 34 rows containing missing values (`position_stack()`).
library(dplyr)
library(ggplot2)
# Calculate the proportion of missing values for each variable
miss_summary <- merged_aqi_weather %>%
summarise(across(everything(), ~sum(is.na(.))/n())) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "MissingProportion")
# Plot the missingness summary for the entire dataset
ggplot(miss_summary, aes(x = Variable, y = MissingProportion)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
# Calculate the proportion of missing values for each variable, grouped by City
miss_summary_city <- merged_aqi_weather %>%
group_by(City) %>%
summarise(across(everything(), ~sum(is.na(.))/n())) %>%
pivot_longer(-City, names_to = "Variable", values_to = "MissingProportion")
# Plot the missingness summary by city
ggplot(miss_summary_city, aes(x = Variable, y = MissingProportion, fill = City)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
library(statisticalModeling)
merged_aqi_weather$training_cases <- rnorm(nrow(merged_aqi_weather)) > 0
# Build base model AQI ~ tavg + prcp + Month + PM2.5 + PM10 + O3 with training cases
model1 <- lm(AQI ~ tavg + prcp + Month + PM2.5 + PM10 + O3, data = subset(merged_aqi_weather, training_cases))
# Evaluate the model for the testing cases
pred_model1 <- evaluate_model(model1, data = subset(merged_aqi_weather,!training_cases))
# Calculate the MSE on the testing data
with(data = pred_model1, mean((AQI - pred_model1$model_output)^2)) ->in_sample_error1
testing_data<-subset(merged_aqi_weather, !training_cases)
plot_data1<-data.frame(predicted=pred_model1$model_output, actual=testing_data$AQI)
ggplot(data=plot_data1, aes(x=predicted, y=actual))+geom_point()+geom_abline(intercept = 0, slope =1, color ="red")
High AQI in Urban Centers: Cities like Ahmedabad, Delhi, Mumbai, Patna, and Bangalore have consistently high AQI values, indicating a significant level of air pollution.
Temporal Variations in AQI: AQI tends to be higher during winter months due to atmospheric conditions that trap pollutants. Conversely, it decreases during the rainy season when pollutants are washed away by rainfall.
Geographical Disparity in AQI: Northern Indian cities generally exhibit higher AQI values compared to those in the Southern region, likely due to differences in industrial activity, population density, and meteorological factors.
Weekday vs. Weekend AQI: No substantial difference in AQI was observed between weekdays and weekends, suggesting a persistent and consistent level of air pollution irrespective of the weekly cycle.
Stable Temperatures in Some Cities: Chennai, Rajasthan, Bangalore, and Mumbai show minor fluctuations in average annual temperatures. Rising Temperatures in Delhi and Lucknow: These cities exhibit a noticeable and consistent increase in temperatures, potentially indicative of urban heat island effects and broader climate change impacts.
Higher Temperatures in Coastal Cities: Mumbai and Chennai, being coastal cities, consistently record higher average temperatures compared to non-coastal cities.
Warming Trend in Non-Coastal Cities: Non-coastal cities, especially Delhi and Lucknow, show a trend of increasing temperatures over recent years.
Increased Rainfall in Recent Decades: Most cities have experienced an increase in average annual precipitation since 2004, which may be linked to changing climate patterns.
Higher Precipitation in Coastal Regions: Coastal regions, particularly Mumbai, receive significantly more rainfall compared to non-coastal areas.
Mumbai’s Exceptional Rainfall: Among coastal cities, Mumbai stands out with substantially higher precipitation levels than Chennai.
Stable Precipitation in Non-Coastal Regions: Non-coastal regions do not show significant changes in average precipitation over the past 30 years.
Predictive models created using the datasets have helped understand the complex interplay between various environmental factors like temperature, precipitation, and air pollutants.
Climate and Air Quality Interactions: The models underscore the impact of climatic factors on air quality, illustrating how weather patterns can influence pollutant dispersion and concentration.
Geospatial Insights: The inclusion of geographical data (latitude, longitude, elevation) provided nuanced insights into regional variations in climate and air quality.
The comprehensive analysis combining air quality, temperature, and precipitation data reveals a multi-faceted picture of environmental conditions across Indian cities. The findings highlight the urgency of addressing urban pollution and the importance of monitoring climate trends to inform policy and urban planning. The observed trends and patterns in these environmental parameters are critical for understanding the broader impacts of urbanization and climate change on public health and ecosystems.
To ensure a comprehensive understanding and accurate interpretation of the data analysis presented in this report, the following sources and references have been utilized: